- Session 1: Motivation, why and how to think about data, and getting started with R
- Session 2: Making basic plots, grammar of graphics, good practices
- Session 3: Advanced graphics, layering, using maps
Beijing, China - May 24-26, 2016
These are two examples of data sets that I've analysed in recent years, and learned a lot by making plots.
The data can be pulled from the web, and the code that produced the plots in these slides is in the .Rmd version, so that you can reproduce this work yourself.
(Slides and material for this workshop can be found at http://dicook.github.io/China-R.)
Big thanks to Xie Yihui, 谢益辉 for these tools!
"R has become the most popular language for data science and an essential tool for Finance and analytics-driven companies such as Google, Facebook, and LinkedIn." Microsoft 2015
CRAN! More than 10000 on github.comIf R were an airplane, RStudio would be the airport, providing many, many supporting services that make it easier for you, the pilot, to take off and go to awesome places. Sure, you can fly an airplane without an airport, but having those runways and supporting infrastructure is a game-changer.
(Diagram from Hadley Wickham)
.Rmd log book to contain your workCreate a project to contain all of the material covered in this set of tutorials:
R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document. R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).
RStudio's cheatsheet gives a nice, concise overview of its capabilities.
RStudio's reference guide lists its options.
Data can be found in R packages
data(economics, package = "ggplot2") # data frames are essentially a list of vectors str(economics) #> Classes 'tbl_df', 'tbl' and 'data.frame': 574 obs. of 6 variables: #> $ date : Date, format: "1967-07-01" "1967-08-01" ... #> $ pce : num 507 510 516 513 518 ... #> $ pop : int 198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ... #> $ psavert : num 12.5 12.5 11.7 12.5 12.5 12.1 11.7 12.2 11.6 12.2 ... #> $ uempmed : num 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ... #> $ unemploy: int 2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...
These are not usually kept up to date but are good for practicing your analysis skills on.
Or in their own packages
library(gapminder) str(gapminder) #> Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables: #> $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ... #> $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... #> $ lifeExp : num 28.8 30.3 32 34 36.1 ... #> $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ... #> $ gdpPercap: num 779 821 853 836 740 ...
More contemporary sets here, but not updated frequently.
I primarily use the readr package for reading data now. It mimics the base R reading functions but is implemented in C so reads large files quickly, and it also attempts to identify the types of variables.
ped <- read_csv("../data/Pedestrian_Counts.csv")
kable(head(ped))
| Date_Time | Sensor_ID | Sensor_Name | Hourly_Counts |
|---|---|---|---|
| 01-MAY-2009 00:00 | 4 | Town Hall (West) | 209 |
| 01-MAY-2009 00:00 | 17 | Collins Place (South) | 28 |
| 01-MAY-2009 00:00 | 18 | Collins Place (North) | 36 |
| 01-MAY-2009 00:00 | 16 | Australia on Collins | 22 |
| 01-MAY-2009 00:00 | 2 | Bourke Street Mall (South) | 52 |
| 01-MAY-2009 00:00 | 1 | Bourke Street Mall (North) | 53 |
Pulling data together yourself, or compiled by someone else.
Look at the document economics data in the ggplot2 package. Can you think of questions you could answer using these variables?
Write these into your .Rmd file.
Read the documentation for gapminder data. Can you think of questions you could answer using these variables?
Write these into your .Rmd file.
Read the documentation for pedestrian sensor data. Can you think of questions you could answer using these variables?
Write these into your .Rmd file.
<- is called getsn_max=50 option to the read_csv function reads just the first 50 linesdim reports the dimensions of the data matrixcolnames shows the column names (you can see these by looking at the object in the RStudio environment window, too)$ specify the column to usetypeof indicates the information format in the column, what R thinksworkers$`Claim Type`
list's are heterogeneous (elements can have different types)data.frame's are heterogeneous but elements have same lengthvector's and matrix's are homogeneous (elements have the same type), which would be why c(1, "2") ends up being a character string.function's can be written to save repeating code again and again
If you'd like to know more, see Hadley Wickham's online chapters on data structures and subsetting
set.seed(1000) x <- rnorm(6) x #> [1] -0.446 -1.206 0.041 0.639 -0.787 -0.385 sum(x + 10) #> [1] 58
R has rich support for documentation, see ?sum[ to extract elements of a vector.x[1] #> [1] -0.45 x[c(T, F, T, T, F, F)] #> [1] -0.446 0.041 0.639
$, [[, and/or [x <- list( a = 10, b = c(1, "2") ) x$a #> [1] 10 x[["a"]] #> [1] 10 x["a"] #> $a #> [1] 10
str() is a very useful R function. It shows you the "structure" of (almost) any R object (and everything in R is an object!!!)str(x) #> List of 2 #> $ a: num 10 #> $ b: chr [1:2] "1" "2"
NA is the indicator of a missing value in Rx <- c(50, 12, NA, 20) mean(x) #> [1] NA mean(x, na.rm=TRUE) #> [1] 27
table function can be used to tabulate numberstable(ped$Sensor_Name) #> #> Alfred Place #> 12365 #> Australia on Collins #> 48310 #> Birrarung Marr #> 44904 #> Bourke St-Russel St (West) #> 18573 #> Bourke St-Russell St (West) #> 2208 #> Bourke Street Mall (North) #> 47254 #> Bourke Street Mall (South) #> 59205 #> Chinatown-Lt Bourke St (South) #> 20911 #> Chinatown-Swanston St (North) #> 19797 #> City Square #> 17860 #> Collins Place (North) #> 59205 #> Collins Place (South) #> 54886 #> Flagstaff Station #> 59205 #> Flinders St-Elizabeth St (East) #> 19917 #> Flinders St-Spark La #> 14470 #> Flinders St-Spring St (West) #> 15574 #> Flinders St-Swanston St (West) #> 9503 #> Flinders Street Station Underpass #> 59205 #> Grattan St-Swanston St (West) #> 6191 #> Lonsdale St (South) #> 2208 #> Lonsdale St-Spring St (West) #> 9047 #> Lonsdale Street (South) #> 16630 #> Lygon St (East) #> 8759 #> Lygon St (West) #> 2208 #> Lygon Street (West) #> 18070 #> Melbourne Central #> 58773 #> Melbourne Convention Exhibition Centre #> 20781 #> Monash Rd-Swanston St (West) #> 7006 #> New Quay #> 56278 #> Princes Bridge #> 59205 #> Queen St (West) #> 8144 #> QV Market-Elizabeth St (West) #> 20542 #> QV Market-Peel St #> 21189 #> Sandridge Bridge #> 59157 #> Southern Cross Station #> 59205 #> Spencer St-Collins St (North) #> 20039 #> Spencer St-Collins St (South) #> 20637 #> St Kilda-Alexandra Gardens #> 18886 #> State Library #> 53616 #> The Arts Centre #> 20311 #> Tin Alley-Swanston St (West) #> 7004 #> Town Hall (West) #> 58869 #> Victoria Point #> 58773 #> Waterfront City #> 59205 #> Webb Bridge #> 58533
+ is a function (which calls compiled C code)`+`
#> function (e1, e2) .Primitive("+")
"+" <- function(x, y) "I forgot how to add" 1 + 2 #> [1] "I forgot how to add"
rm("+")
Reading documentation only gets you so far. What about finding function(s) and/or package(s) to help solve a problem???
Google! (I usually prefix "CRAN" to my search; others might suggest http://www.rseek.org/
Ask your question on a relevant StackExchange outlet such as http://stackoverflow.com/ or http://stats.stackexchange.com/
It's becoming more and more popular to bundle "vignettes" with a package (dplyr has awesome vignettes)
browseVignettes("dplyr")
This is a current project (joint with Ben Marwick, Rob Hyndman, Heike Hofmann, Carson Sievert, Nathaniel Tomasetti). Code and data are provided to study the electoral maps and system.
There is a shiny app that facilitates interactive exploration of the data.
Notes prepared by Di Cook, building on joint workshops with Carson Sievert, Heike Hofmann, Eric Hare, Hadley Wickham.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.